Packages

Install and load necessary packages.

my_package <- c("purrr", "tidyr", "cluster", "dplyr", "ggplot2", "plotly", "ggrepel", "factoextra")

for (package in my_package){
  if(!package %in% installed.packages()){
    install.packages(package)
  }
  library(package, character.only = TRUE)
}

Data set

We will use NBA data set downloaded from Kaggle and create a subsetted data set using stats from the year 2017 and consisting of:

Some players were traded during the season and thus have stats for different teams. The “TOT” team is a summation of stats from different teams but since we will perform our own summation, we will exclude that row of data.

We will scale the statistics by the number of games each player has played, since not everyone has played the same number of games. We will also standardise the data to ensure so that all the stats are on the same scale.

season_stat <- read.csv("../data/Seasons_Stats.csv.gz")

# some players played in different teams
season_stat %>%
  filter(Year == 2017, Tm != "TOT", G > 50) %>%
  select(Player, Pos, G, FG, FGA, FT, FTA, X3P, X3PA, PTS, TRB, AST, STL, BLK, TOV) -> data_subset

data_subset %>%
  group_by(Player, Pos) %>%
  summarise_all(sum) -> data_subset

# scale stats by number of games played and normalise
data_subset[, -(1:3)] <- data_subset[, -(1:3)] / data_subset$G
data_subset[, -(1:3)] <- scale(data_subset[, -(1:3)])

str(data_subset)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  298 obs. of  15 variables:
##  $ Player: Factor w/ 3922 levels "","A.C. Green",..: 10 11 46 48 56 70 79 93 113 120 ...
##  $ Pos   : Factor w/ 24 levels "","C","C-F","C-PF",..: 14 17 2 2 17 21 2 21 11 2 ...
##  $ G     : int  65 80 68 66 61 68 77 79 80 81 ...
##  $ FG    : num  -1.004 0.432 0.743 -0.204 -0.468 ...
##  $ FGA   : num  -0.913 0.472 0.688 -0.349 -0.237 ...
##  $ FT    : num  -0.8468 0.0707 -0.157 -0.5368 -0.1661 ...
##  $ FTA   : num  -0.9191 0.1816 -0.2001 -0.5661 -0.0719 ...
##  $ X3P   : num  -0.3136 -0.0341 0.3429 -1.2349 0.1967 ...
##  $ X3PA  : num  -0.382 0.28 0.388 -1.329 0.347 ...
##  $ PTS   : num  -0.963 0.316 0.523 -0.445 -0.344 ...
##  $ TRB   : num  -1.3468 0.28 0.9961 -0.0657 1.2278 ...
##  $ AST   : num  -0.204 -0.229 1.363 -0.752 -0.359 ...
##  $ STL   : num  -1.029 0.0404 -0.0505 -1.278 0.5131 ...
##  $ BLK   : num  -0.7984 0.0516 1.8841 -0.554 0.5719 ...
##  $ TOV   : num  -0.411 -0.289 0.457 -1.059 0.25 ...
##  - attr(*, "vars")= chr "Player"
##  - attr(*, "drop")= logi TRUE

We’ll plot histograms of all standardised statistics to visualise the distributions.

data_subset[, -(1:3)] %>%
  gather() %>%
  ggplot(., aes(value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~key)

K-means

The idea behind k-means clustering is to define clusters such that the total within-cluster variation is minimised. The within-cluster variation is calculated as the sum of squared Euclidean distances between observations and the centroid of a cluster. The total within-cluster variation is the sum of all within-cluster calculations for k clusters.

We will use kmeans to perform k-means clustering with a k of 5 since there are 5 positions in basketball.

my_kmeans <- kmeans(x = data_subset[, -(1:3)], centers = 5)
my_kmeans
## K-means clustering with 5 clusters of sizes 55, 38, 37, 92, 76
## 
## Cluster means:
##           FG         FGA         FT        FTA        X3P       X3PA
## 1  0.5402600  0.67293960  0.2057350  0.1674639  0.8291409  0.8810535
## 2  0.3731844  0.08816558  0.1152853  0.3383343 -0.9537861 -1.0022327
## 3  1.8889051  1.89442110  2.1594918  2.0751352  1.1598864  1.1497105
## 4 -0.9116478 -0.94910885 -0.6809150 -0.6943854 -0.6578456 -0.6750902
## 5 -0.3935947 -0.30444127 -0.4335959 -0.4600496  0.1085147  0.1209988
##          PTS         TRB           AST        STL        BLK        TOV
## 1  0.5398968  0.02384795  0.3726234915  0.4910906 -0.2050806  0.3701375
## 2  0.1649223  1.54425829 -0.3886730733 -0.1330500  1.4376318  0.1011581
## 3  2.0345121  0.67103975  1.4927302065  1.1761302  0.3217780  1.7642182
## 4 -0.9005311 -0.57737252 -0.6628982512 -0.7821900 -0.2696480 -0.7817424
## 5 -0.3735454 -0.41715331  0.0004066609  0.0854024 -0.4006414 -0.2310176
## 
## Clustering vector:
##   [1] 4 1 1 4 5 4 2 5 4 2 5 5 5 3 3 5 4 4 1 1 5 2 3 4 1 4 3 4 5 5 5 3 5 3 5
##  [36] 4 5 3 5 3 2 2 4 4 5 5 4 5 1 3 1 5 4 4 1 1 4 4 4 2 4 3 3 5 1 4 1 3 5 4
##  [71] 1 4 4 1 2 4 3 5 1 1 2 3 1 1 1 5 1 5 1 5 3 4 3 3 2 2 1 2 4 5 3 5 5 1 1
## [106] 1 4 1 5 5 4 3 1 4 5 5 4 4 4 4 5 3 4 4 5 4 3 5 5 5 2 3 5 4 2 4 1 5 1 4
## [141] 2 4 4 5 3 3 5 4 3 2 1 1 3 3 4 1 4 4 4 2 4 3 2 3 2 5 5 4 4 3 3 4 4 4 5
## [176] 5 4 5 3 2 5 1 1 4 1 5 5 1 2 5 5 5 4 4 2 3 4 4 4 5 4 2 4 4 1 1 4 5 2 1
## [211] 2 4 5 1 5 4 1 4 5 5 2 3 3 5 4 5 1 4 2 1 1 2 1 4 4 2 2 3 1 4 4 1 4 2 5
## [246] 1 4 4 4 5 5 5 4 3 2 5 2 2 4 4 2 5 4 5 5 4 5 1 4 4 1 4 5 5 5 1 2 4 4 2
## [281] 5 5 1 5 4 4 1 5 5 4 1 1 4 4 2 1 2 4
## 
## Within cluster sum of squares by cluster:
## [1] 289.0522 207.1909 419.2085 178.9860 234.6102
##  (between_SS / total_SS =  62.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The cluster assignments are in cluster and since we set k to 5 each player is assigned to 1 of 5 possible clusterings.

table(my_kmeans$cluster)
## 
##  1  2  3  4  5 
## 55 38 37 92 76

The total within-cluster variation is stored in tot.withinss.

my_kmeans$tot.withinss
## [1] 1329.048

We can use fviz_cluster to visualise the clusters in a scatter plot of the first two principal components.

fviz_cluster(my_kmeans, data = data_subset[, -(1:3)])

In our example above, we chose a k of 5 simply because we assume that each player position produces distinctive statistics. For example, a centre will have more rebounds and blocks, and a guard will have more assists and steals. However, this may not be the ideal number of clusters.

One way for determining an optimal number of clusters is to plot the total within-cluster variation for a range of k values and find the “elbow” point in the plot. This point is where the total within-cluster variation has a steep drop and forms a “visual elbow” in the plot.

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(2:30,  function(k){
  model <- kmeans(x = data_subset[, -(1:3)], centers = k)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 2:30,
  tot_withinss = tot_withinss
)

ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  geom_point(aes(x = k, y = tot_withinss)) +
  scale_x_continuous(breaks = 2:30)

Another method for determining a suitable k is the silhouette approach, which measures the within cluster distance of an observation to all other observations within its cluster and to all other observations in the closest neighbour cluster. A value close to 1 indicates that an observation is well matched to its cluster; a value of 0 indicates that the observation is on the border between two clusters; and a value of -1 indicates that the observation has a better fit in the neighbouring cluster.

# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:30,  function(k){
  model <- pam(x = data_subset[, -(1:3)], k = k)
  model$silinfo$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:30,
  sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  geom_point(aes(x = k, y = sil_width)) +
  scale_x_continuous(breaks = 2:30)

The silhouette approach suggests that a k of 2 is optimal.

my_kmeans_k_2 <- kmeans(data_subset[, -(1:3)], centers = 2)
fviz_cluster(my_kmeans_k_2, data = data_subset[, -(1:3)])

Extra

Below I perform a Principal Component Analysis and use the Plotly package to produce an interactive plot for those who are interested in the players who have much more variable statistics.

my_pca <- prcomp(data_subset[, -(1:3)], center = FALSE, scale = FALSE)

summary(my_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.6793 1.4954 0.99165 0.80456 0.56571 0.52699
## Proportion of Variance 0.5982 0.1863 0.08195 0.05394 0.02667 0.02314
## Cumulative Proportion  0.5982 0.7846 0.86652 0.92046 0.94713 0.97027
##                            PC7     PC8     PC9    PC10    PC11      PC12
## Standard deviation     0.48851 0.29441 0.13550 0.09352 0.06547 7.642e-16
## Proportion of Variance 0.01989 0.00722 0.00153 0.00073 0.00036 0.000e+00
## Cumulative Proportion  0.99016 0.99738 0.99891 0.99964 1.00000 1.000e+00
my_pca_df <- as.data.frame(my_pca$x)
my_pca_df$pos <- data_subset$Pos
my_pca_df$name <- data_subset$Player

p <- ggplot(my_pca_df, aes(x = PC1, y = PC2, colour = pos, text = name)) +
  geom_point()

ggplotly(p)

If we label the points, we can clearly see that the players with more variable statistics consist of many NBA All-Stars.

p <- ggplot(my_pca_df, aes(x = PC1, y = PC2, colour = pos, label = name)) +
  geom_text_repel(
    data = my_pca_df %>% filter(PC1 > 5 | PC2 < -3.7)
  ) +
  geom_point() +
  theme_classic()

p

ggsave(filename = "nba_pca.png", plot = p)
## Saving 6 x 5 in image

Future reading